Smoothing a Lexicon-based POS Tagger for Arabic and Hebrew
نویسندگان
چکیده
We propose an enhanced Part-of-Speech (POS) tagger of Semitic languages that treats Modern Standard Arabic (henceforth Arabic) and Modern Hebrew (henceforth Hebrew) using the same probabilistic model and architectural setting. We start out by porting an existing Hidden Markov Model POS tagger for Hebrew to Arabic by exchanging a morphological analyzer for Hebrew with Buckwalter's (2002) morphological analyzer for Arabic. This gives state-of-theart accuracy (96.12%), comparable to Habash and Rambow’s (2005) analyzerbased POS tagger on the same Arabic datasets. However, further improvement of such analyzer-based tagging methods is hindered by the incomplete coverage of standard morphological analyzer (Bar Haim et al., 2005). To overcome this coverage problem we supplement the output of Buckwalter's analyzer with synthetically constructed analyses that are proposed by a model which uses character information (Diab et al., 2004) in a way that is similar to Nakagawa's (2004) system for Chinese and Japanese. A version of this extended model that (unlike Nakagawa) incorporates synthetically constructed analyses also for known words achieves 96.28% accuracy on the standard Arabic test set.
منابع مشابه
Probabilistic Arabic Part of Speech Tagger with Unknown Words Handling
Part Of Speech (POS) tagger is an essential preprocessing step in many natural language applications. In this paper, we investigate the best configuration of trigram Hidden Markov Model (HMM) Arabic POS tagger when small tagged corpus is available. With small training data, unknown word POS guessing is the main problem. This problem becomes more serious in languages which have huge size of voca...
متن کاملPart-of-speech tagging of Modern Hebrew text
Words in Semitic texts often consist of a concatenation of word segments, each corresponding to a Part-of-Speech (POS) category. Semitic words may be ambiguous with regard to their segmentation as well as to the POS tags assigned to each segment. When designing POS taggers for Semitic languages, a major architectural decision concerns the choice of the atomic input tokens (terminal symbols). If...
متن کاملNeoTag: a POS Tagger for Grammatical Neologism Detection
POS Taggers typically fail to correctly tag grammatical neologisms: for a known word, most taggers will only take known tags into account, and hence discard the possibility that that word is used in a novel or deviant grammatical category in a new text. Grammatical neologisms are relatively rare, and therefore do not pose a significant problem for the overall performance of a tagger. But for st...
متن کاملRule Based Approach for Arabic Part of Speech Tagging and Name Entity Recognition
The aim of this study is to build a tool for Part of Speech (POS) tagging and Name Entity Recognition for Arabic Language, the approach used to build this tool is a rule base technique. The POS Tagger contains two phases:The first phase is to pass word into a lexicon phase, the second level is the morphological phase, and the tagset are (Noun, Verb and Determine). The Named-Entity detector will...
متن کاملFast Development of Basic NLP Tools: Towards a Lexicon and a POS Tagger for Kurmanji Kurdish
The development of basic NLP resources for minority languages is still a challenge to both formal and computational linguists. In this paper, we show how we were able to develop a medium-scale morphological lexicon for Kurmanji Kurdish in a few days time using only freely accessible resources. We also developed a preliminary POS tagger that shall be used as a pre-annotation tool for developing ...
متن کامل